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Clock-Delayed Domino for Dynamic Circuit Design 

Gin Yee and Carl Sechen 



Abstract — Clock-delayed (CD) domino is a self-timed dynamic logic 
family developed to provide single-rail gates with inverting or nonin verting 
outputs. CD domino is a complete logic family and is as easy to design 
with as static CMOS circuits from a logic design and synthesis perspective. 
Design tools developed for static CMOS are used as part of a methodology 
for automating the design of CD domino circuits* The methodology and 
CD domino's characteristics are. demonstrated in the design of a 32-b 
carry look-ahead adder. The adder was fabricated with MOSIS's 0.8- 
CMOS process with scalable CMOS design rules that allow a 1.0-jjm 
drawn gate length. Measurements of the adder show a worst case addition 
of 2.1 ns. The CD domino adder is 1.6 X faster than a dual-rail domino 
adder designed with the same cell library and technology. 

Index Terms — Adder, dynamic logic circuit, dynamic logic clocking, in- 
verting single-rail dynamic gates, self-timed circuits. 



I. Introduction 

Dynamic circuits have become necessary for designing high-speed 
and compact circuits, as seen by their use in microprocessors [1], [2]. 
The reduced input capacitance and use of only nMOS logic transistors 
make domino circuits faster and smaller than their static counterparts 
[3]. One of dynamic logic's major shortcomings is its monotonic na- 
ture. This restriction causes domino logic, a widely used dynamic logic 
family, to allow only noninverting functions. However, current logic 
synthesis tools and most practical circuit designs require the flexibility 
to use inverting and noninverting functions in any combination. While 
dynamic logic with inverting outputs are known, they generate both po- 
larities of the output in differential or dual-rail fashion or use latches 
[4]-[6]. 

Clock-delayed domino provides any logic function through the use 
of a self-timed delay-matched clock tree for the precharge and eval- 
uation clock used in dynamic circuits. The delayed clocks are set to 
always arrive after the data inputs to a dynamic gate have settled. A de- 
layed clock is used in wave-domino logic for wave-pipelining, but was 
not considered for providing single-rail inverting or noninverting dy- 
namic logic gates [7]. Footless domino uses delayed clocks to reduce 
the through-current between power and ground [1], Delayed evalua- 
tion was used for a multiplier design, and delayed precharge was used 
for reducing power consumption and noise during the precharge phase 
of domino circuits [8], [9]. Delay matching for self-timed circuits was 
used in an SRAM design [10]. 

A 32-b carry look-ahead adder (CLA) was designed using CD 
domino to demonstrate its design flexibility and speed advantages. 
The CLA logic equations were rewritten to take advantage of high 
fanin dynamic NOR and OR gates, and a dynamic XOR gate. 

The following section describes the key characteristics of CD 
domino. Section III describes a design methodology and tools for 
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clock output 



Fig. 1. CD domino gate with a dynamic gate and delay element. 




Fig. 2. CD domino path. 

CD domino. Next, the design flexibility and speed advantages of 
CD domino are demonstrated in the design of a 32-b CLA adder in 
Section IV. This is followed by the adder's measurement results in 
Section V and the conclusion. 

II. Clock-Delayed Domino Logic 

A. CD Domino Operation 

CD domino addresses the logic and circuit design difficulties of dy- 
namic circuits by providing any logic function without restrictions on 
how the gates are used together. This is accomplished through the use 
of self-timed delays for the precharge and evaluation clocks. Each CD 
domino gate consists of a dynamic gate and, if necessary, a delay ele- 
ment, as shown in Fig. I. It is similar to the self-timed scheme using 
delay matching as defined in [1 1]. The self-timed clock output of the 
delay element tells the next gate when the data output is ready, as shown 
in Fig. 2. Thus, the delay set by the delay element is always greater than 
the worst case delay of the dynamic gate, plus a margin. 

It is critical that the clock output rising edge occurs after the gate 
output has switched and never before. The delay elements will always 
be the critical path, and the data hazard in dynamic logic caused by non- 
monotonic input transitions during the evaluation phase is prevented. 
This self-timed scheme for CD domino uses single-rail dynamic gates, 
which is in contrast to the well-known dual -rail self-timed scheme 
shown in Fig. 3 [12]. 

B. CD Domino Gates 

Noninverting CD domino gates are simply domino gates as described 
in [3]. Inverting CD domino gates can be designed by removing or 
adding an inverter to a domino gate, as shown by the second gate in 
Fig. 2. An inverting CD domino gate with its output precharged low is 
shown in Fig. 4. While not necessary, having the output precharged low 
for all dynamic gates maintains the monotonic nature of dynamic gates 
and prevents a precharge glitch possible with precharge high dynamic 
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Fig. 3. Dual -rail self-timed circuit. 




Fig. 4. An inverting dynamic gate that precharges low. 

gates. The delay element used in the inverting CD domino gate shown 
in Fig. 4 is set to the delay of the nMOS pull-down network plus a 
margin to provide correct operation. 

C Delay Element for Delay Matching 

A delay element, shown in Fig. 5, is used to create a delay chain for 
the delayed clocks used to precharge and evaluate the dynamic gates. 
The delay of the delay element and the rest of the delay circuit consists 
of four components matched to its corresponding dynamic gate: the in- 
trinsic gate delay, the output net wiring delay, the fanout gate load delay, 
and a margin. The gate delay is set equal to the worst case pull-down 
delay of the corresponding dynamic gate during the evaluation phase. 
Matching the gate delay can be done using a dummy domino gate. A 
margin is added to account for setup times to the next gate, variations 
in fabrication processes, voltage, and temperature variations between 
the delay element and its gate, and differences in the signal delay due to 
output wiring, fanout load, and coupling parasitics. A margin of at least 
20% of the gate delay was added to the delay elements for the adder cir- 
cuit to insure proper precharge and evaluation of the CD domino gates. 

D. Simplest CD Domino Clocking Scheme 

For CD domino circuits, two clocking schemes are possible. In the 
most basic scheme, only the slowest gate at each gate level needs to 
have a delay element. The clock output from the previous gate level 
would be used by the gates at the next level. This scheme, shown in 
Fig. 6, is similar to the wave-pipelining method used by Lien and 
Burleson for wave-domino [7]. 

The primary outputs provided by this clocking scheme must wait 
for the slowest gate at each gate level. Thus, the performance of the 
simplest CD domino clocking scheme is based on the slowest gate at 
each gate level and not necessarily the critical path. 

E. General CD Domino Clocking Scheme 

The second clocking scheme is based on a more general self-timed 
delay tree obtained by having each gate use the clock output from 



Vctl 




(b) 



Fig. 5. Example adjustable delay element. 

the gate of its slowest input, rather than the same clock for the entire 
gate level. As Fig. 7 shows, this general approach shifts away from 
the timing constraints in pipelining. At the cost of a few extra delay 
elements, the outputs of the circuit evaluate faster because each gate 
does not have to wait for the slowest gate's clock output from the pre- 
vious gate level, just the clock output from its slowest input. If a gate's 
output is not the slowest input to any gate, then it does not need a delay 
element. 

III. METHODOLOGY AND TOOLS FOR CD DOMDMO CIRCUITS 

While synthesis and design tools have been developed for standard 
domino, they operate with the constraint of using dual-rail or differ- 
ential domino logic [ 13]— [1 51. By providing any logic function, CD 
domino gates can be used by design tools developed for static CMOS 
gates by simply replacing the static CMOS library with a CD domino 
library. After logic synthesis or handcrafted design with a tool designed 
for static circuits, delay elements are inserted into the circuit netlist to 
provide the correct self-timed delays, and a proper precharge-evalua- 
tion clock. Thus, combinational logic blocks can be designed having 
the speed of dynamic logic using CD domino, but without the func- 
tional limitations or higher area and power costs found in other dy- 
namic logic families. 

A design methodology developed for CD domino circuits is shown 
in Fig. 8. Cell and delay element library design has been described in 
earlier sections. Library characterization is self explanatory. The other 
steps are described as follows. 

A. Gate Netlist 

The methodology allows the gate netlist to be handcrafted for custom 
designs, or generated by a synthesis tool. Any synthesis tool developed 
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Fig. 6. Simplest clocking scheme for CD domino. 
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Fig. 7. General clocking scheme for CD domino. 
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Fig. 8. Design methodology for CD domino. 

for static CMOS circuits can be used for this step. The library given to 
the synthesis tool is still "static** as far as the tool is concerned. The key 
difference is that each gate in the "static" library corresponds directly 
to a gate in the dynamic library and has the same functionality. Thus, at 
this point, the netlist is still a static CMOS netlist. Care must be taken 
to allow the synthesis tool to fully take advantage of the fast high fanin 
dynamic gates available in the library. 



levetizejietlistO 

set flops to gate_level = 0 

for each gate with primary inputs or flop inputs 

gatejevel - 1 
for i=l to maxiraum_gate_Jevel 
for each gate 

if all inputs are from <- level i 
gatejevel = i + 1 

timing_analysis() 

levelize_netlist() 

for gates at level i=l to max_gate_level 
ifi-l 

gate_path_delay = gate_delay 

else 

finoLsJowesUnputO 
gatc„path_delay « slowesUnput->gate_j)ath_4elay + gate„delay 



Fig. 9. Algorithm for gate levelization and timing analysis. 



B. Circuit Timing Analysis 

A circuit timing analysis tool was developed for the CD domino 
methodology. Using the characterized gate and delay element delay 
information in a lookup table, a static timing analysis is done on the 
netlist by levelizing the netlist (assigning a gate level to each gate), and 
adding up the path delays to each gate. The netlist levelizing and timing 
analysis algorithms are given in Fig. 9. The timing analysis tool simply 
adds up the delays of the gates along all paths starting with gates at 
the first gate level. More accurate delay estimation is made using the 
extracted layout information fed back to the timing analysis tool. As 
a final check* the tool also writes out a critical path SPICE simulation 
with the extracted parasitics, delay elements, and gates. 
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Fig. 10. Block diagram of static and domino 32-b CLA adder. 
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Fig. 1 1 . Block diagram of CD domino adder. 

C. Delay Element Insertion and Clock Assignment 

The timing analysis data is provided to the delay element insertion 
tool to generate the clocking network depending on the user option for 
the simplest clocking scheme or the general scheme described in Sec- 
tion II. The tool also pads out short paths with dynamic buffers! 1 -input 
OR) to ensure enough evaluation time overlap between delayed clocks. 
For example, a problem can arise if a gate with clki drives a gate with 
clk9 and they do not have any overlap in their clocks' evaluation phase. 
The number of delayed clock phases that a signal can skip is a user 
input. 

Once the delay elements have been chosen according to the clocking 
scheme, the delayed clock phases are assigned to each gate. For the sim- 
plest clocking scheme, clock assignment is determined by each gate's 
gate level, as determined by level ize_netlist(). For the general clocking 
scheme, each gate gets its clock from its slowest input, which will have 
a delay element associated with it. The static netlist is now converted 
to a dynamic gate netlist by directly mapping the gates to a dynamic 
library, and assigning the clocks according to the user defined clocking 
scheme. 

O. Placement and Routing 

Placement and routing of the design is done using the standard cell 
library and dynamic gate netlist. The skew between each clock phase is 
reduced by restricting the placement, and is done using industry place- 
ment tools. For example, with the simplest clocking scheme, gates at 
the same level are placed in the same rows to reduce the wire length 
of the clock net. After extraction, the parasitic data is fed back to the 
liming analysis tool for more accurate delay calculation. 




Lput-(aXNORb) 



Fig. 12. Schematic of dynamic 2-input XNOR. 

IV. Adder Design 

The carry lookahead (CLA) adder design was chosen to demonstrate 
the use of CD domino gates and the design methodology described in 
Section III. 

A. Static and Dual-Rail Adder Design 

When using static CMOS or standard domino logic, a common 32-b 
CLA adder design uses eight 4-b full adder (FA) blocks with two CLA 
logic levels, as shown in Fig. 10. The CLA FA logic equation for the 
first level is given by ( 1 ), while (2) and (3) provide the CLA logic equa- 
tions for the second and third levels [16] 

<fi — nibi, p, — «t hi, si — pi eL 1 r,j 
co = n„, r, = go +/>()<:;„, 
<>i = 0\ + P\0o + P^Poc\ n ^ 

f'\ — 0-2 + P?<M + P2?>1<7Q +P2Pli>0^in (D 
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TABLE I 

Simulated adder Comparisons in MOSIS* 0.8-/* m Technology 



32-bit adder type 


Simulated worst 
case delay [ns] 


Adder core area 
[mm 2 ] 


Avg. power [W] 


Power-delay 
product [W-ns] 


Static CMOS 


6.97 


0.565 


0.0756 


0.527 


Dual-Rail 


4.23 


0.923 


0.1994 


0.845 


CD Domino 


2.73 


0.858 


0.1733 


0.473 



Po = p3p2PlPO, * ' * ? P3 = P15P14P13P12 
Gq = (Jz 4* P302 + P3P201 + P3P2pl0O, ' ' ' 
(?3 = 9\o 4" Pl5<7l4 +Pl5pl4ffl3 + Pl5Pt4pl3#l 2 
Ci = Go + PoC in , C 2 = G, + Pi Go + Pi PoCin 
C 3 = G 2 + P 2 G, + P 2 Pl Go + P2P1 PoCi„ (2) 

PS = PaPjPiPo, P? = PrPePsPi 
GS = G 3 + PiG 2 + P 3 P 2 G! + PzPiPxGo 
= G 7 4- PtGg + PtPgGb 4- P-PcPsG4 
C? = GS + Pjcin, C5 = G? + PTGS 4- Pf PSc in . (3) 

The design in Fig. 10 uses 4-b FA blocks because of the speed limi- 
tations imposed by the number of series transistors required for higher 
carry bit logic in standard domino and static CMOS logic. In a purely 
standard domino CLA design, the dual-rail approach must be used to 
provide both polarities of the inputs needed by the XOR gates {s x and 
Pi functions). The dual of ( 1 H3) are not shown since they can be easily 
derived by inverting all inputs, swapping AND gates for OR gates, 
XNOR gates for XOR gates, and vice versa. 

B. CD Domino Adder Design 

Using CD domino gates, a CLA adder was designed to demonstrate 
the use of inverting and fast high fanin dynamic gates. The adder uses 
eight 4-b FA blocks with a single level of CLA logic, as shown in 
Fig. 1 1 . The 8-b CLA logic block provides the lookahead carry values 
{Ci) for the eight FA blocks. The CLA FA logic equations are given 
by (4). They are the same as in ( 1 ) except inverted to take advantage of 
wide NOR gates instead of AND gates using DeMorgan's Law. Like- 
wise, the 8-b CLA logic block equations are given by (5) and also 
inverted to use wide NOR gates. Although inputs and intermediated 
nodes are all inverted, the final sum outputs are not inverted, and both 
polarities of gi and Gi are provided with static inverters. An 8-b CLA 
logic block is practical because of the fast high fanin OR and NOR gates 
possible with dynamic logic. This feature of CD domino improves the 
performance of the adder design, but would not be practical with static 
CMOS or dual-rail domino 

gi — <ii 4- hi, pi — di 0 hi, -v, = pi £ c f 
cq = c\ n , ri = go 4- {po 4- Ci„ ), • • • 

C.\ — 9'2 + (P2 + </l ) + h (P'2 + Pi + PO + ("in ) (4) 

Po = pll + p2 + Pi + PO » * ■ ■ 

Pt = p31 + p30 + p20 + P28 
Go = £f3 + (p3 + 02 ) + * ' * 
G- = 93Y + (P31 + ff-to) + 

Cl = G 0 + (Po + frn),--- 

Ch = G T + (PT + G fi ) + - 




Fig. 13. CD domino adder chip layout. 

Because inverting and noninverting gates can be used together, 
compact dynamic XOR and XNOR gates are used in the design 
without requiring a dual-rail approach for CD domino. Fig. 12 shows 
the schematic for the dynamic XNOR gate used in the dynamic adder 
designs.. 

The performance improvements obtained using CD domino are pri- 
marily due to the reduction in the number of gate levels. Using high 
fanin NOR and OR gates allows the collapsing of logic cones into 
fewer levels. In more conventional logic families such as static CMOS 
or dual-rail domino, it is not possible to use very wide gates (such as 
eight input NOR's and OR's) due to the nominal limit on the number of 
series transistors (typically three or four, and even at four the edge rates 
are degraded). Significant reductions in the number of gate levels trans- 
lates to large performance improvements even though a 20% margin is 
designed into the delay elements. For the CD domino adder chip, pro- 
grammable delay elements were used to allow adjustable delays after 
fabrication, as shown in Fig. 5. A global external input voltage, Vctl, is 
routed to all delay elements to control the delay through the delay ele^ 
ments and increase or decrease the margin. For all of the work reported 
here, Vctl was set at VDD to give a nominal margin of at least 20%. 
Reducing Vctl below VDD will only slow down the delay element and 
increase the margin, if needed. 



4- (pa 4-j>2 4- pi 4- go), 



4- (p:n 4- P30 4- p-20 4- [tea ) 



V. Adder Results 

Static CMOS and standard domino versions of the 32-b CLA adders, 
__ === __ = ______ = _ := described in the previous section, were designed and compared to the 

4- ( Pt 4- Pfi H 4- Po 4- c\ n )• (5) CD domino adder. The dual -rail domino and CD domino adders were 
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Fig. 14. Circuit used to measure adder delay within chip core. 



TABLE II 

Measured adder Results from MOSIS adder Chips 



32-bit adder type 


Measured worst case 
delay (ns) 


Adder core area [mm 2 ] 


Delay-area [ns-mra 2 ] 


Dual-rail domino 


3.4 


0.923 


3.14 


CD domino 


2.1 


0.858 


1.80 



designed using the same standard cell library of dynamic gates. A sepa- 
rate library was developed for the static adder. The designs were routed 
with three metal layers using MOSIS* scalable CMOS 0.8-^m process, 
which allows 1 .0-/im drawn gate lengths. Both libraries were devel- 
oped with speed and area considerations. Table I compares the simu- 
lated results from extracted netlists for the worst case 32-b addition, the 
total area of the adder layout cores, and the power-delay product. The 
CD domino adder is 1 .6 x faster than the dual-rail domino design, while 
requiring 8% less area. The simulation and measured delay results re- 
ported for CD domino have the margin of at least 20% designed into 
the delay elements. The CD domino adder had a power-delay product 
better than the static adder by over 10%. The CD domino adder chip is 
shown in Fig. 13. The circuit diagram shown in Fig. 14 was used to ob- 
tain the worst case adder delays within the chip core. Table II provides 
the measured results for the dual-rail and CD domino adders. 

VI. CONCLUSION 

A self-timed dynamic logic family, an automated methodology, and 
tools for designing dynamic circuits with it were presented. CD domino 
provides any logic function allowed by static CMOS, as well as fast 
high fanin gates that are not possible with static circuits or standard 
domino logic. As the adder comparisons in the previous section show, 
the CD domino implementation of the 32-b CLA adder has significant 
speed improvements over the static CMOS and dual-rail domino coun- 
terparts. Measurements from fabricated chips show the CD domino 
adder to be more than 1 .6 x faster than the dual-rail domino adder using 
the same gate library, and requiring 8% less area. 

The performance and functionality advantages are possible because 
CD domino provides fast high fanin dynamic gates while maintaining 
the flexibility in design of static CMOS. The use of higher fanin gates 
does not hurt the speed of each gate and improves overall performance 
by reducing the number of logic levels in the critical path. Just as im- 
portant is the significant reduction in the design time of the self-timed 
dynamic circuits made possible by the methodology and design au- 
tomation tools developed for CD domino. Fast and robust CD domino 
circuits can be designed with practical delay element margins. 
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